Main Processing:
As technology progresses in this age, using data to address every type of issues has started to become important. In this report we attempt to use a data-driven approach to enhance safety in the city of Colchester using two datasets. By analysing the patterns and correlation between various crimes and the whether condition, we may be able to identify patterns to devise measures to increase safety and reduce crimes by analysing the data to help police prioritise the important cases with focused effort, rather than accidentally putting more focus on cases with less significant importance.
#Libraries
library(stringr)
library(knitr)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(DT)
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(viridis)
## Loading required package: viridisLite
library(ggcorrplot)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## The following object is masked from 'package:plotly':
##
## select
library(leaflet)
library(leaflet.extras)
#set working directory
getwd()
## [1] "/Users/praj/Documents/Lectures/Data Viz"
setwd("/Users/praj/Documents/Lectures/Data Viz")
#reading the data
rm(list=ls())
initial_crime <- read.csv("crime24.csv")
initial_temp <- read.csv("temp24.csv")
#View the structure and dimension of data:
dim(initial_crime)
## [1] 6304 13
dim(initial_temp)
## [1] 366 18
str(initial_crime)
## 'data.frame': 6304 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ category : chr "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" ...
## $ persistent_id : chr "" "" "" "" ...
## $ date : chr "2024-01" "2024-01" "2024-01" "2024-01" ...
## $ lat : num 51.9 51.9 51.9 51.9 51.9 ...
## $ long : num 0.901 0.899 0.902 0.888 0.89 ...
## $ street_id : int 2153130 2153105 2153147 2152856 2152871 2153107 2152963 2152963 2153186 2153163 ...
## $ street_name : chr "On or near Middle Mill" "On or near Conference/exhibition Centre" "On or near Mason Road" "On or near Kensington Road" ...
## $ context : logi NA NA NA NA NA NA ...
## $ id : int 115967607 115967129 115967591 115967062 115967058 115967547 115967516 115967638 115967128 115967378 ...
## $ location_type : chr "Force" "Force" "Force" "Force" ...
## $ location_subtype: chr "" "" "" "" ...
## $ outcome_status : chr NA NA NA NA ...
str(initial_temp)
## 'data.frame': 366 obs. of 18 variables:
## $ station_ID : int 3590 3590 3590 3590 3590 3590 3590 3590 3590 3590 ...
## $ Date : chr "2024-12-31" "2024-12-30" "2024-12-29" "2024-12-28" ...
## $ TemperatureCAvg: num 6.5 5.6 3.3 4 5.3 6.7 9.4 4.3 4.6 7.2 ...
## $ TemperatureCMax: num 7.7 6.9 4.9 5.8 6.7 10 12.3 6.9 7.9 11 ...
## $ TemperatureCMin: num 5 3.4 2.2 2.3 4.3 5.6 3.5 2.5 2.5 3.3 ...
## $ TdAvgC : num 4.4 4.9 3.2 3.7 5.1 6.4 8.8 1.8 -0.5 4.5 ...
## $ HrAvg : num 86.4 94.9 98.6 98.4 98.4 98.3 95.6 84.2 70 83 ...
## $ WindkmhDir : chr "WSW" "WSW" "W" "SW" ...
## $ WindkmhInt : num 22.7 16.7 11.4 5.5 6.3 9.3 15.4 16.4 36.8 28 ...
## $ WindkmhGust : num 42.6 40.8 22.2 14.8 16.7 22.2 31.5 50 70.4 66.7 ...
## $ PresslevHp : num 1025 1028 1028 1032 1035 ...
## $ Precmm : num 0 0 0.4 0.4 0.4 0.4 0 0 0.8 0.8 ...
## $ TotClOct : num 4.5 8 8 8 8 8 6.8 6.7 4.3 6.6 ...
## $ lowClOct : num 7.2 8 8 8 8 8 6.8 7.6 5.2 6.9 ...
## $ SunD1h : num 5.7 0 0 0 0 0 0 1.4 2.8 0 ...
## $ VisKm : num 63.4 15.3 0.5 0.1 0.5 0.2 13.3 20 38.8 34.9 ...
## $ SnowDepcm : int NA NA NA NA NA NA NA NA NA NA ...
## $ PreselevHp : logi NA NA NA NA NA NA ...
# Check for missing values
sum(is.na(initial_crime))
## [1] 7014
print(sum(is.na(initial_crime)))
## [1] 7014
sum(is.na(initial_temp))
## [1] 759
print(sum(is.na(initial_temp)))
## [1] 759
Data Preparation and cleaning:
Data preparation is an important step in the analytical process, to ensure integrity and reliability of data sources. The raw/initial crime dataset underwent a cleaning process to address missing values, standardise formats, and remove irrelevant/non-important information. Numeric columns were checked as well, ensuring that missing values were replaced with appropriate values(like mean values). Textual data(like Street Names) is converted to a consistent format for accurate analysis and visualization. We have removed the context and location sub-type columns as they are either empty or have NA values.
Similarly, the temp(we can also consider it the weather data) dataset also underwent cleaning process to ensure its consistency and completeness. Inconsistent variables were spotted and removed/replaced, making the dataset proper for analysis.
#Setting new variable for cleaned dataset
cleaned_crime <- initial_crime
#Listing numeric columns
num_col_crime <- sapply(cleaned_crime, is.numeric)
#Replacing NA values with mean
cleaned_crime[num_col_crime] <- lapply(cleaned_crime[num_col_crime], function(x) {
ifelse(is.na(x), round(mean(x, na.rm = TRUE), 1), x)
})
#Data cleaning in cleaned set:
#Filling missing values in outcome_status
cleaned_crime$outcome_status[is.na(cleaned_crime$outcome_status)] <- "No Information available"
cleaned_crime$street_name <- str_trim(str_to_lower(cleaned_crime$street_name))
cleaned_crime$date <- ym(cleaned_crime$date)
#Remove irrelevant data columns
cleaned_crime <- subset(cleaned_crime, select = -c(context, location_subtype))
##Repeating process for temp data
cleaned_temp <- initial_temp
num_col_temp <- sapply(cleaned_temp, is.numeric)
cleaned_temp[num_col_temp] <- lapply( cleaned_temp[num_col_temp],
function(x) {
ifelse(is.na(x), round(mean(x, na.rm = TRUE), 1), x)
}
)
cleaned_temp$Date <- ymd(cleaned_temp$Date)
cleaned_temp <- cleaned_temp[, !names(cleaned_temp) %in% c("PreselevHp", "SnowDepcm")]
dim(cleaned_crime)
## [1] 6304 11
head(cleaned_crime)
## X category persistent_id date lat long street_id
## 1 1 anti-social-behaviour 2024-01-01 51.89301 0.901028 2153130
## 2 2 anti-social-behaviour 2024-01-01 51.88979 0.898830 2153105
## 3 3 anti-social-behaviour 2024-01-01 51.89825 0.902107 2153147
## 4 4 anti-social-behaviour 2024-01-01 51.87837 0.888373 2152856
## 5 5 anti-social-behaviour 2024-01-01 51.87905 0.889521 2152871
## 6 6 anti-social-behaviour 2024-01-01 51.88860 0.899203 2153107
## street_name id location_type
## 1 on or near middle mill 115967607 Force
## 2 on or near conference/exhibition centre 115967129 Force
## 3 on or near mason road 115967591 Force
## 4 on or near kensington road 115967062 Force
## 5 on or near lambeth road 115967058 Force
## 6 on or near trinity street 115967547 Force
## outcome_status
## 1 No Information available
## 2 No Information available
## 3 No Information available
## 4 No Information available
## 5 No Information available
## 6 No Information available
dim(cleaned_temp)
## [1] 366 16
head(cleaned_temp)
## station_ID Date TemperatureCAvg TemperatureCMax TemperatureCMin TdAvgC
## 1 3590 2024-12-31 6.5 7.7 5.0 4.4
## 2 3590 2024-12-30 5.6 6.9 3.4 4.9
## 3 3590 2024-12-29 3.3 4.9 2.2 3.2
## 4 3590 2024-12-28 4.0 5.8 2.3 3.7
## 5 3590 2024-12-27 5.3 6.7 4.3 5.1
## 6 3590 2024-12-26 6.7 10.0 5.6 6.4
## HrAvg WindkmhDir WindkmhInt WindkmhGust PresslevHp Precmm TotClOct lowClOct
## 1 86.4 WSW 22.7 42.6 1025.3 0.0 4.5 7.2
## 2 94.9 WSW 16.7 40.8 1028.5 0.0 8.0 8.0
## 3 98.6 W 11.4 22.2 1028.5 0.4 8.0 8.0
## 4 98.4 SW 5.5 14.8 1031.8 0.4 8.0 8.0
## 5 98.4 S 6.3 16.7 1034.7 0.4 8.0 8.0
## 6 98.3 WSW 9.3 22.2 1033.6 0.4 8.0 8.0
## SunD1h VisKm
## 1 5.7 63.4
## 2 0.0 15.3
## 3 0.0 0.5
## 4 0.0 0.1
## 5 0.0 0.5
## 6 0.0 0.2
Validating data after cleaning:
The structure and summary of the data was seen and we confirmed that there are no more NA values in the table since the is.na function returned value 0.
#View the structure of the data
str(cleaned_crime)
## 'data.frame': 6304 obs. of 11 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ category : chr "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" ...
## $ persistent_id : chr "" "" "" "" ...
## $ date : Date, format: "2024-01-01" "2024-01-01" ...
## $ lat : num 51.9 51.9 51.9 51.9 51.9 ...
## $ long : num 0.901 0.899 0.902 0.888 0.89 ...
## $ street_id : int 2153130 2153105 2153147 2152856 2152871 2153107 2152963 2152963 2153186 2153163 ...
## $ street_name : chr "on or near middle mill" "on or near conference/exhibition centre" "on or near mason road" "on or near kensington road" ...
## $ id : int 115967607 115967129 115967591 115967062 115967058 115967547 115967516 115967638 115967128 115967378 ...
## $ location_type : chr "Force" "Force" "Force" "Force" ...
## $ outcome_status: chr "No Information available" "No Information available" "No Information available" "No Information available" ...
str(cleaned_temp)
## 'data.frame': 366 obs. of 16 variables:
## $ station_ID : int 3590 3590 3590 3590 3590 3590 3590 3590 3590 3590 ...
## $ Date : Date, format: "2024-12-31" "2024-12-30" ...
## $ TemperatureCAvg: num 6.5 5.6 3.3 4 5.3 6.7 9.4 4.3 4.6 7.2 ...
## $ TemperatureCMax: num 7.7 6.9 4.9 5.8 6.7 10 12.3 6.9 7.9 11 ...
## $ TemperatureCMin: num 5 3.4 2.2 2.3 4.3 5.6 3.5 2.5 2.5 3.3 ...
## $ TdAvgC : num 4.4 4.9 3.2 3.7 5.1 6.4 8.8 1.8 -0.5 4.5 ...
## $ HrAvg : num 86.4 94.9 98.6 98.4 98.4 98.3 95.6 84.2 70 83 ...
## $ WindkmhDir : chr "WSW" "WSW" "W" "SW" ...
## $ WindkmhInt : num 22.7 16.7 11.4 5.5 6.3 9.3 15.4 16.4 36.8 28 ...
## $ WindkmhGust : num 42.6 40.8 22.2 14.8 16.7 22.2 31.5 50 70.4 66.7 ...
## $ PresslevHp : num 1025 1028 1028 1032 1035 ...
## $ Precmm : num 0 0 0.4 0.4 0.4 0.4 0 0 0.8 0.8 ...
## $ TotClOct : num 4.5 8 8 8 8 8 6.8 6.7 4.3 6.6 ...
## $ lowClOct : num 7.2 8 8 8 8 8 6.8 7.6 5.2 6.9 ...
## $ SunD1h : num 5.7 0 0 0 0 0 0 1.4 2.8 0 ...
## $ VisKm : num 63.4 15.3 0.5 0.1 0.5 0.2 13.3 20 38.8 34.9 ...
#View summary of the data
summary(cleaned_crime)
## X category persistent_id date
## Min. : 1 Length:6304 Length:6304 Min. :2024-01-01
## 1st Qu.:1577 Class :character Class :character 1st Qu.:2024-03-01
## Median :3152 Mode :character Mode :character Median :2024-07-01
## Mean :3152 Mean :2024-06-15
## 3rd Qu.:4728 3rd Qu.:2024-09-01
## Max. :6304 Max. :2024-12-01
## lat long street_id street_name
## Min. :51.88 Min. :0.8788 Min. :2152686 Length:6304
## 1st Qu.:51.89 1st Qu.:0.8966 1st Qu.:2153025 Class :character
## Median :51.89 Median :0.9013 Median :2153155 Mode :character
## Mean :51.89 Mean :0.9029 Mean :2153873
## 3rd Qu.:51.89 3rd Qu.:0.9088 3rd Qu.:2153366
## Max. :51.90 Max. :0.9246 Max. :2343256
## id location_type outcome_status
## Min. :115954844 Length:6304 Length:6304
## 1st Qu.:118009952 Class :character Class :character
## Median :120228058 Mode :character Mode :character
## Mean :120403000
## 3rd Qu.:122339060
## Max. :125550731
summary(cleaned_temp)
## station_ID Date TemperatureCAvg TemperatureCMax
## Min. :3590 Min. :2024-01-01 Min. :-2.60 Min. : 1.10
## 1st Qu.:3590 1st Qu.:2024-04-01 1st Qu.: 7.00 1st Qu.:10.72
## Median :3590 Median :2024-07-01 Median :10.95 Median :14.75
## Mean :3590 Mean :2024-07-01 Mean :10.98 Mean :15.08
## 3rd Qu.:3590 3rd Qu.:2024-09-30 3rd Qu.:14.50 3rd Qu.:19.60
## Max. :3590 Max. :2024-12-31 Max. :23.10 Max. :29.80
## TemperatureCMin TdAvgC HrAvg WindkmhDir
## Min. :-6.100 Min. :-6.000 Min. :59.60 Length:366
## 1st Qu.: 3.325 1st Qu.: 4.725 1st Qu.:75.90 Class :character
## Median : 6.800 Median : 8.200 Median :82.75 Mode :character
## Mean : 6.486 Mean : 7.752 Mean :81.74
## 3rd Qu.: 9.500 3rd Qu.:11.000 3rd Qu.:88.80
## Max. :16.700 Max. :16.900 Max. :98.60
## WindkmhInt WindkmhGust PresslevHp Precmm
## Min. : 3.90 Min. : 11.10 Min. : 978.9 Min. : 0.000
## 1st Qu.:12.22 1st Qu.: 31.50 1st Qu.:1007.5 1st Qu.: 0.000
## Median :15.80 Median : 38.90 Median :1013.8 Median : 0.200
## Mean :16.52 Mean : 40.81 Mean :1013.7 Mean : 1.866
## 3rd Qu.:19.80 3rd Qu.: 48.20 3rd Qu.:1021.0 3rd Qu.: 1.900
## Max. :42.50 Max. :105.60 Max. :1037.3 Max. :38.000
## TotClOct lowClOct SunD1h VisKm
## Min. :0.000 Min. :1.000 Min. : 0.000 Min. : 0.10
## 1st Qu.:3.800 1st Qu.:5.900 1st Qu.: 0.325 1st Qu.:20.73
## Median :5.600 Median :6.900 Median : 3.500 Median :30.95
## Mean :5.304 Mean :6.609 Mean : 4.203 Mean :31.42
## 3rd Qu.:7.200 3rd Qu.:7.600 3rd Qu.: 7.100 3rd Qu.:41.20
## Max. :8.000 Max. :8.000 Max. :15.600 Max. :71.20
#Check for missing values to reconfirm removal
sum(is.na(cleaned_crime))
## [1] 0
sum(is.na(cleaned_temp))
## [1] 0
Analysing the crime dataset:
We checked the category of crime and its frequency. Our dataset contained a total of 14 categories of crimes. The crime having the highest frequency of 2420 was violent crimes and the one with the lowest at 65 was Possession of weapons. We will also be creating a two-way table for analysing different crimes and their outcomes.
#Creating new table for frequency:
crime_freq <- as.data.frame(table(cleaned_crime$category))
#Originally, columns are named as var1 and freq, so we will be renaming them
colnames(crime_freq) <- c("Category", "Crime Frequency")
#Now we calculate the frequency of the crimes to find out which is the highest committed offence
max_freq <- crime_freq[which.max(crime_freq$`Crime Frequency`), ]
max_freq
## Category Crime Frequency
## 14 violent-crime 2420
#Calculating crime percentage
crime_freq <- crime_freq %>%
arrange(desc(`Crime Frequency`)) %>%
mutate(Percentage = round((`Crime Frequency` / sum(`Crime Frequency`)) * 100, 2))
datatable(crime_freq, options = list(pageLength = 10))
#Creating a two-way table for crime category and its outcome
category_outcome <- table(cleaned_crime$category, cleaned_crime$outcome_status)
kable(category_outcome)
| Action to be taken by another organisation | Awaiting court outcome | Court result unavailable | Formal action is not in the public interest | Further action is not in the public interest | Further investigation is not in the public interest | Investigation complete; no suspect identified | Local resolution | No Information available | Offender given a caution | Status update unavailable | Suspect charged as part of another case | Unable to prosecute suspect | Under investigation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| anti-social-behaviour | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 710 | 0 | 0 | 0 | 0 | 0 |
| bicycle-theft | 3 | 3 | 2 | 0 | 0 | 0 | 113 | 0 | 0 | 0 | 2 | 0 | 16 | 10 |
| burglary | 1 | 10 | 6 | 0 | 0 | 0 | 108 | 0 | 0 | 0 | 7 | 0 | 26 | 13 |
| criminal-damage-arson | 6 | 28 | 34 | 0 | 1 | 0 | 246 | 9 | 0 | 11 | 10 | 0 | 110 | 24 |
| drugs | 2 | 13 | 19 | 7 | 4 | 0 | 18 | 118 | 0 | 16 | 9 | 0 | 16 | 43 |
| other-crime | 4 | 7 | 12 | 2 | 1 | 1 | 14 | 1 | 0 | 1 | 5 | 0 | 43 | 9 |
| other-theft | 1 | 6 | 5 | 2 | 0 | 0 | 283 | 1 | 0 | 0 | 12 | 1 | 74 | 27 |
| possession-of-weapons | 2 | 8 | 9 | 0 | 1 | 0 | 6 | 4 | 0 | 4 | 8 | 0 | 16 | 7 |
| public-order | 8 | 20 | 20 | 11 | 0 | 0 | 165 | 12 | 0 | 3 | 27 | 0 | 160 | 32 |
| robbery | 0 | 7 | 4 | 1 | 0 | 0 | 41 | 0 | 0 | 0 | 4 | 0 | 25 | 3 |
| shoplifting | 5 | 82 | 65 | 2 | 0 | 0 | 313 | 25 | 0 | 1 | 7 | 0 | 92 | 37 |
| theft-from-the-person | 2 | 2 | 0 | 0 | 0 | 0 | 66 | 0 | 0 | 0 | 2 | 0 | 14 | 5 |
| vehicle-crime | 1 | 7 | 6 | 0 | 0 | 0 | 208 | 0 | 0 | 0 | 3 | 0 | 33 | 12 |
| violent-crime | 84 | 96 | 107 | 12 | 7 | 1 | 446 | 31 | 0 | 20 | 141 | 0 | 1195 | 280 |
Creating plots for various analysis:
Plotting of various datasets is important to figure out the correlation of data or to find patterns. In our case, plotting has shown that violent crimes and anti-social behavior are the most common crimes in Colchester in 2024. The pie chart shows that 38.4% of total crimes is just violent crimes, which I believe does warrant more study as to why it is so. More study of external factors, not just the weather, such as Income, employability, access to mental health support can be studied to find out the root cause.
#pie chart for various crimes category and their percentage
plot_ly(crime_freq, labels = ~Category, values = ~Percentage, type = 'pie',
textinfo = 'label+percent',
insidetextorientation = 'radial',
marker = list(line = list(color = '#FFFFFF', width = 0.5))) %>%
layout(title = "Crime Distribution by Category - Percentages")
By hovering over the bar plot, we can see the exact number of crimes of that Category that have been committed. This is for easier understanding of the numbers as, at first glance, we can see the general number of crimes to gauge the severity and then we can hover over the bars for more details.
The second bar plot, we can see that the most common place for crimes to happen are in or near supermarkets, which we can attribute to being crowded places, followed by shopping areas.
#Interactive bar plot for category to frequency analysis
plot_ly(crime_freq, x = ~reorder(Category, -`Crime Frequency`), y = ~`Crime Frequency`, type = 'bar',
marker = list(color = 'lightblue')) %>%
layout(title = "Interactive Crime Frequency by Category", xaxis = list(title = ""),
yaxis = list(title = "Frequency"))
#Bar plot to check the crime distribution in Colchester
top_5_crime_places <- cleaned_crime %>%
count(street_name, sort = TRUE) %>%
slice_max(n, n = 5)
ggplot(top_5_crime_places, aes(x = reorder(street_name, n), y = n)) +
geom_col(fill = "darkred") +
coord_flip() +
labs(title = "Top 5 places where crimes happen in Colchester", x = "Place", y = "Number of Crimes")
For the 2D density plot, the darker colors indicate the higher crime rates in that area.We can observe quite the contrast at the centre of the plot.
For the most common crimes, which we will be considering as the top 4 crimes, most of their outcomes were as follows : For anti-social behavior - No information available for all the cases. For criminal damage arson and shoplifting, we have the most common outcome as Investigation complete, no suspect identified, while for violent crimes, it is Unable to prosecute. From all these outcomes we can see that for the most commonly committed crimes, there are no severe repercutions/outcomes which would work as a deterrant for offenders commiting similar crimes. Bar plots are easy to read and with the shortened forms for the various outcomes, they are easy to understand as well. Setting the colors on our makes makes the readability of the plots better.
#2D Density plot for crimes
ggplot(cleaned_crime, aes(x = long, y = lat)) +
stat_density_2d(aes(fill = ..level..), geom = "polygon", contour = TRUE) +
scale_fill_viridis_c() +
labs(title = "2D Density Plot: Crimes", x = "Longitude", y = "Latitude") +
theme_minimal() +
coord_fixed()
## Warning: The dot-dot notation (`..level..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(level)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#Bar plot to check top crime category and outcomes
crime_outcome <- cleaned_crime %>%
group_by(category, outcome_status) %>%
summarise(count = n(), .groups = 'drop')
top_categories <- crime_outcome %>%
group_by(category) %>%
summarise(total_count = sum(count)) %>%
arrange(desc(total_count)) %>%
head(4) %>%
pull(category)
crime_outcome_top <- crime_outcome %>%
filter(category %in% top_categories)
#Shortening the labels because they are too big for x-axis
crime_outcome_top <- crime_outcome_top %>%
mutate(outcome_shortened = case_when(
outcome_status == "No Information available" ~ "NI",
outcome_status == "Investigation complete; no suspect identified" ~ "IC",
outcome_status == "Court result unavailable" ~ "CRU",
outcome_status == "Status update unavailable" ~ "SUU",
outcome_status == "Unable to prosecute suspect" ~ "UP",
outcome_status == "Local resolution" ~ "LR",
outcome_status == "Further action is not in the public interest" ~ "NAct",
outcome_status == "Action to be taken by another organisation" ~ "AO",
outcome_status == "Offender given a caution" ~ "OC",
outcome_status == "Awaiting court outcome" ~ "ACO",
outcome_status == "Formal action is not in the public interest" ~ "NoA",
outcome_status == "Suspect charged as part of another case" ~ "CAC",
outcome_status == "Further investigation is not in the public interest" ~ "NFI",
outcome_status == "Under investigation" ~ "UI",
TRUE ~ outcome_status
))
ggplot(crime_outcome_top, aes(x = reorder(outcome_shortened, -count), y = count, fill = outcome_shortened)) +
geom_bar(stat = "identity") +
facet_wrap(~category, scales = "free_x") +
theme(axis.text.x = element_blank()) +
labs(title = "Crime Outcomes according to Top 4 Categories", x = "Outcome", y = "Count") +
theme_minimal() +
scale_fill_manual(values = c(
"NI" = "grey",
"IC" = "lightblue",
"CRU" = "lightgreen",
"SUU" = "lightpink",
"UP" = "lightyellow",
"LR" = "lightcoral",
"NAct" = "lightgray",
"AO" = "lightsalmon",
"OC" = "lightseagreen",
"ACO" = "lightsteelblue",
"NoA" = "lightgoldenrodyellow",
"CAC" = "violet",
"NFI" = "darkorange",
"UI" = "salmon"
))
The interactive box plot helps to check what the outcome was for a specific crime. Hovering over the plot will show the exact outcomes of the crime after investigation.
#Box plot to display outcomes for crimes:
category_colors <- c(
"anti-social-behaviour" = "red",
"bicycle-theft" = "skyblue",
"burglary" = "lightgreen",
"criminal-damage-arson" = "purple",
"drugs" = "orange",
"other-theft" = "yellow",
"possession-of-weapons" = "pink",
"public-order" = "cyan",
"robbery" = "brown",
"shoplifting" = "magenta",
"theft-from-the-person" = "lightblue",
"vehicle-crime" = "darkgreen",
"violent-crime" = "darkred",
"other-crime" = "blue"
)
plot_ly(cleaned_crime, y = ~outcome_status, type = "box", color = ~category, colors = category_colors) %>%
layout(title = "Box Plot: Outcomes by Crime Category")
Checking trends by smoothing:
We further analyse the trends of the violent crimes accrding to the months and smoothing plays a key role in trend analysis as it works on the fluctuating data that we have for each month by reducing noise and revealing the underlying trend in the data so that we get a clearer picture for interpretation and future analysis. We can see the general predicted trend and the actual values of that month. The values being below the line mean that the crimes in that month were slightly above the estimated average and the bar crossing through the line indicate that the crimes were below the predicted trend and for some months, like March, October and December, it does follow the predicted trend.
#Checking the monthly trends of violent crimes(using smoothing)
violent_crimes <- cleaned_crime %>% filter(category == "violent-crime")
violent_crimes$month <- floor_date(violent_crimes$date, "month")
violent_crimes_monthly <- violent_crimes %>%
group_by(month) %>%
summarise(crime_count = n(), .groups = 'drop')
ggplot(violent_crimes_monthly, aes(x = month, y = crime_count)) +
geom_bar(stat = "identity", fill = "orange", width = 25) +
geom_smooth(method = "auto", se = FALSE, color = "black") +
labs(title = "Monthly Trend in Violent Crimes", x = "Month", y = "Crime Count") +
theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Checking crimes with respect to the weather(temp) information provided:
According to the time-series plot, we can see that as the the temperature spiked then dropped, there was an increase in the crimes committed. Whereas, during the time periods where the the temperature is relatively stable or without high fluctuation, we can observe that there are less crimes. From this we can conclude that the drastic change in weather is a factor for increased crimes. All of this can be interpreted from our time-series plot, which is a type of plot that shows how something changes over time.
#Doing a time-series plot for crime count with respect to temperature
temp_for_crime <- cleaned_crime %>%
mutate(date = as.Date(date)) %>%
group_by(date) %>%
summarise(crime_count = n()) %>%
left_join(cleaned_temp, by = c("date" = "Date"))
scaling_factor = 10
ggplot(temp_for_crime, aes(x = date)) +
geom_line(aes(y = crime_count), color = "blue", linewidth = 1) +
geom_line(aes(y = TemperatureCAvg * scaling_factor), color = "red", linetype = "dashed", linewidth = 1) +
scale_y_continuous(
name = "Crime Count",
sec.axis = sec_axis(~ . / scaling_factor, name = "Average Temperature (°C)")
) +
labs(title = "Time-Series of Crime Count and Temperature", x = "Date") +
theme_minimal()
Correlation analysis:
For our correlation plot, the values will be ranging from +1 to -1 where +1 indicates perfect positive correlation which means that as one value increases(temperature) the other(crimes) also increase and -1 means negative correlation, that is, temperature and crimes are inversely proportional. 0 means there is no correlation. From our bar plot, we can see that the number of crimes are higher in Summer, but with violent crimes being higher in count in Winter
#Checking if there is any correlatin between various crimes and the weather
crime_tempData_correlation <- temp_for_crime %>%
dplyr::select(
Crimes = crime_count,
Avg_temp = TemperatureCAvg,
Max_temp = TemperatureCMax,
Min_temp = TemperatureCMin,
Hour_avg = HrAvg
) %>%
cor(use = "complete.obs")
ggcorrplot(crime_tempData_correlation, lab = TRUE)
#Analysing crimes by seasons:
cleaned_crime$season <- case_when(
month(cleaned_crime$date) %in% 3:5 ~ "Spring",
month(cleaned_crime$date) %in% 6:8 ~ "Summer",
month(cleaned_crime$date) %in% 9:11 ~ "Autumn",
TRUE ~ "Winter"
)
#Summarising crime counts by season and category
crime_by_season <- cleaned_crime %>%
group_by(season, category) %>%
summarise(crime_count = n(), .groups = "drop")
#Define custom colors for seasons (optional)
season_colors <- c("Spring" = "Green",
"Summer" = "#Orange",
"Autumn" = "#DarkOrange",
"Winter" = "#Blue")
#plotting stacked bar chart
ggplot(crime_by_season, aes(x = season, y = crime_count, fill = category)) +
geom_col(position = "stack") +
scale_fill_viridis_d(option = "C", name = "Crime Category") + # Distinct colors for categories
scale_x_discrete(drop = FALSE) + # Ensure all seasons are shown
labs(title = "Crime Distribution by Season & Category", x = "Season", y = "Number of Crimes") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
legend.position = "right"
)
Heatmap to see crime hotspots:
A heatmap on a map shows where, in our instance, a crime is happening in more intensity(more counts of crime) using the color gradient. The places with reddish-orange hue indicate that there are more crimes in that area as compared to the lower number of crimes indicated by the pale blue-green colors. By using the leaflet package, out heatmap is interactive, which means that we are able to zoom, click, explore into a specific region to explore the data in depth upto a certain level.
crime_heatmap <- cleaned_crime %>% filter(!is.na(lat) & !is.na(long))
leaflet(data = crime_heatmap) %>% addTiles() %>% addHeatmap(lng = ~long, lat = ~lat, radius = 10)
Leaflet for violent crimes:
Similar to the heatmap using leaflet, we care checking the highest value’s category over a map. Observing this interactive map, we can conclude that most of the violent crimes are committed more at the city centre than its surrounding areas.
violent_crimes <- cleaned_crime %>% filter(category == "violent-crime")
violent_map <- violent_crimes[!is.na(violent_crimes$lat) & !is.na(violent_crimes$long), ]
leaflet(data = violent_map) %>%
addTiles() %>%
addCircleMarkers(lng = ~long, lat = ~lat,
radius = 3, color = "blue", stroke = FALSE, fillOpacity = 0.4,
popup = ~paste(category, outcome_status))